{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"../../img/ods_stickers.jpg\" />\n",
    "    \n",
    "## [mlcourse.ai](mlcourse.ai) – Open Machine Learning Course \n",
    "### <center> Author: Nikita Simonov (ODS Slack nick: simanjan)</center>\n",
    "    \n",
    "## <center> Prediction of airlines delay</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Feature and data explanation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT's monthly Air Travel Consumer Report, published about 30 days after the month's end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.\n",
    "\n",
    "This version of the dataset was compiled from the Statistical Computing Statistical Graphics 2009 Data Expo and is also available [here](http://stat-computing.org/dataexpo/2009/the-data.html). We will consider flight data for 1987."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import all necessary libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load dataset. Change path if needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data = pd.read_csv(\"../../data/1987.csv.bz2\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Variable descriptions\n",
    "\n",
    "      Name\tDescription    \n",
    "    - Year\t1987\n",
    "    - Month\t1-12\n",
    "    - DayofMonth\t1-31\n",
    "    - DayOfWeek\t1 (Monday) - 7 (Sunday)\n",
    "    - DepTime\tactual departure time (local, hhmm)\n",
    "    - CRSDepTime\tscheduled departure time (local, hhmm)\n",
    "    - ArrTime\tactual arrival time (local, hhmm)\n",
    "    - CRSArrTime\tscheduled arrival time (local, hhmm)\n",
    "    - UniqueCarrier\tunique carrier code\n",
    "    - FlightNum\tflight number\n",
    "    - TailNum\tplane tail number\n",
    "    - ActualElapsedTime\tin minutes\n",
    "    - CRSElapsedTime\tin minutes\n",
    "    - AirTime\tin minutes\n",
    "    - ArrDelay\tarrival delay, in minutes\n",
    "    - DepDelay\tdeparture delay, in minutes\n",
    "    - Origin\torigin IATA airport code\n",
    "    - Dest\tdestination IATA airport code\n",
    "    - Distance\tin miles\n",
    "    - TaxiIn\ttaxi in time, in minutes\n",
    "    - TaxiOut\ttaxi out time in minutes\n",
    "    - Cancelled\twas the flight cancelled?\n",
    "    - CancellationCode\treason for cancellation (A = carrier, B = weather, C = NAS, D = security)\n",
    "    - Diverted\t1 = yes, 0 = no\n",
    "    - CarrierDelay\tin minutes\n",
    "    - WeatherDelay\tin minutes\n",
    "    - NASDelay\tin minutes\n",
    "    - SecurityDelay\tin minutes\n",
    "    - LateAircraftDelay\tin minutes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The target feuture is a 'cancelled'."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Primary data analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data.groupby('Cancelled').size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.crosstab(raw_data['Month'], raw_data['DayOfWeek'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data.groupby(['UniqueCarrier','FlightNum'])['Distance'].sum().sort_values(ascending=False).iloc[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Number of all flights by days of week and months:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Primary visual data analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unique carrier plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data.groupby('UniqueCarrier').size().plot(kind='bar');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Top five of the largest flights by a total distance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset contains a data for only three month.\n",
    "\n",
    "Now let's look on the cancelled flights by dayOfweek, dayOfMonth and month."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data[raw_data['Cancelled'] == 1].groupby(['DayOfWeek']).size().plot(kind='bar');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data[raw_data['Cancelled'] == 1].groupby(['DayofMonth']).size().plot(kind='bar');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data[raw_data['Cancelled'] == 1].groupby(['Month']).size().plot(kind='bar');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Cancelled flights by distance. Here need normalize data for best representation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize = (12,6))\n",
    "ax.hist([raw_data['Distance'], raw_data[raw_data['Cancelled'] == 1]['Distance']], \n",
    "        normed=True, label=['All', 'Cancelled'])\n",
    "\n",
    "ax.set_xlim(0,3000)\n",
    "ax.set_xlabel('Distance')\n",
    "ax.set_title('Histogram of Flight Distances')\n",
    "\n",
    "plt.legend()\n",
    "plt.show();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Cancelled flights plot UniquerCarrier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data[raw_data['Cancelled'] == 1].groupby('UniqueCarrier').size().plot(kind='bar');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look on the top five cancelled flights by Origin and Dest columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data[raw_data['Cancelled'] == 1].groupby(['Origin', 'Cancelled']).size()\\\n",
    ".sort_values(ascending=False).iloc[:5].plot(kind='bar');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data[raw_data['Cancelled'] == 1].groupby(['Dest', 'Cancelled']).size()\\\n",
    ".sort_values(ascending=False).iloc[:5].plot(kind='bar');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize = (12,6))\n",
    "\n",
    "ax.hist([raw_data['CRSDepTime'], raw_data[raw_data['Cancelled']== 1]['CRSDepTime']], normed=True,\n",
    "        label=['All', 'Cancelled'])\n",
    "\n",
    "ax.set_xlabel('Scheduled Departure Time')\n",
    "ax.set_title('Histogram of Scheduled Departure Times')\n",
    "\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Box plot for Distance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize = (15,6))\n",
    "sns.boxplot(raw_data['Distance'], ax=ax);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The distance column can be scaled for better result."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Building a heatmap for a correlations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(8,8))\n",
    "sns.heatmap(raw_data.corr(), ax=ax);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Insights and found dependencies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The plots show that there are missing data, as well as some signs correlate with each other."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Metrics selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we have the task of binary 0 or 1 classification as a metric, we can choose:\n",
    "    - Accuracy score.\n",
    "    - Recall score.\n",
    "    - F1 score.\n",
    "    - Precision score."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6. Model selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data contain both binary and categorical features. Therefore, as a model,  can choose both a logical regression and a gradient boosting model. We let's see both."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try to building a model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7. Data preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dropping columns with all NaN values like a 'TailNum' or other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data.drop(['TailNum', 'AirTime', 'TaxiIn', 'TaxiOut'], axis = 1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dropping all delayes columns and also cancellationCode."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data.drop(['CancellationCode', 'CarrierDelay',\n",
    "               'WeatherDelay', 'NASDelay','SecurityDelay', 'LateAircraftDelay'], axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All columns associated with cancelled flights have a MaN in the some columns, like a 'Carrier Delay', or other delays. Filing it by zero."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data.fillna(0, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Separate the target variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data_target = raw_data['Cancelled']\n",
    "raw_data.drop('Cancelled', axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Replace Unique Carriers by index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "unique_carrier_list = raw_data['UniqueCarrier'].value_counts().index.tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data['UniqueCarrier'] = raw_data['UniqueCarrier'].apply(lambda x: int(unique_carrier_list.index(x)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do the same with Origin and Dest."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "origin_list = raw_data['Origin'].value_counts().index.tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data['Origin'] = raw_data['Origin'].apply(lambda x : int(origin_list.index(x)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dest_list = raw_data['Dest'].value_counts().index.tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data['Dest'] = raw_data['Dest'].apply(lambda x : int(dest_list.index(x)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Splitting a dataset and training a Logistic Regression on the dataset.\n",
    "\n",
    "Now let's splitting the data for two parts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(raw_data, raw_data_target, test_size=0.33, \n",
    "                                                    shuffle=True, random_state=17)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 8. Cross-validation and adjustment of model hyperparameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from catboost import CatBoostClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import StratifiedKFold"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "accuracy_score_list = []\n",
    "recall_score_list = []\n",
    "f1_score_list = []\n",
    "precision_score_list = []\n",
    "\n",
    "for train_index,test_index in skf.split(raw_data,raw_data_target):\n",
    "    xtr,xvl = raw_data.loc[train_index], raw_data.loc[test_index]\n",
    "    ytr,yvl = raw_data_target.loc[train_index], raw_data_target.loc[test_index]\n",
    "    \n",
    "    log_reg = LogisticRegression()\n",
    "    log_reg.fit(xtr,ytr)\n",
    "    predicted = log_reg.predict(xvl)\n",
    "    \n",
    "    accuracy_score_list.append(accuracy_score(predicted, yvl))\n",
    "    recall_score_list.append(recall_score(predicted, yvl))\n",
    "    f1_score_list.append(f1_score(predicted, yvl))\n",
    "    precision_score_list.append(accuracy_score(predicted, yvl))\n",
    "\n",
    "    \n",
    "print('Accuracy score mean:{}'.format(np.mean(accuracy_score_list)))\n",
    "print('Recall score mean:{}'.format(np.mean(recall_score_list)))\n",
    "print('F1 score mean:{}'.format(np.mean(f1_score_list)))\n",
    "print('Precision score mean:{}'.format(np.mean(precision_score_list)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the metrics one can conclude that the target variable in the data has a class imbalance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Try a CatBoostClassifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_boost = CatBoostClassifier(random_seed=17, iterations=10)\n",
    "cat_boost.fit(X_train, y_train, verbose=False, plot=True);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 9.  Creation of new features and description of this process"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a new feature 'DepTimeHour' - departure time by hour. Departure time in the night hours can be cause of cancelled flight. Also creating a new features 'Night', 'Morning', 'Afternoon', 'Evening'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data['DepTimeHour'] = raw_data['CRSDepTime'].apply(lambda x: round(x / 100))\n",
    "\n",
    "raw_data['Night'] = raw_data['DepTimeHour'].apply(lambda x: int(7 >= x >= 0))\n",
    "raw_data['Morning'] = raw_data['DepTimeHour'].apply(lambda x:  int(12 >= x > 7))\n",
    "raw_data['Afternoon'] = raw_data['DepTime'].apply(lambda x: int(18 >= x > 12))\n",
    "raw_data['Evening'] = raw_data['DepTime'].apply(lambda x: int(23 >= x > 18))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10. Plotting training and validation curves"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import precision_recall_curve, roc_curve, auc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "precision, recall, thresholds = precision_recall_curve(y_test, cat_boost.predict(X_test))\n",
    "thresholds_min = np.argmin(np.abs(thresholds))\n",
    "closest_zero_p = precision[thresholds_min]\n",
    "closest_zero_r = recall[thresholds_min]\n",
    "\n",
    "fig, ax= plt.subplots(figsize=(8,8))\n",
    "ax.plot(precision, recall, label='Precision-Recall Curve')\n",
    "ax.plot(closest_zero_p, closest_zero_r)\n",
    "ax.set_xlabel('Precision')\n",
    "ax.set_ylabel('Recall')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 11. Prediction for test samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Accuracy of LogisticRegression :{}'.format(accuracy_score(log_reg.predict(X_test), y_test)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Accuracy of Catboost Classifier:{}'.format(accuracy_score(cat_boost.predict(X_test), y_test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 12. Conclusions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The percentage of  canceled flights by total."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data_target.value_counts()[1] / raw_data_target.value_counts()[0] * 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The models did a great job predicting flight cancellation. After preprocessing the data, the models proved to be even too good. The reason for this was the class imbalance. \n",
    "\n",
    "The target variable cancelled = 1 is only 1.5 percent by total. In the gradient boosting model, we always predicted the target variable with one hundred percent accuracy, even without fine tuning the model."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
